Análisis de los datos de Airbnb de la ciudad de Barcelona
Guardando la Api key de Google
[1] TRUE
Cargamos los datos de Barcelona para el análisis
Limpiamos los datos
test <- capture.output(str_trim(mll_data$neighbourhood_cleansed, side=c("both")))
mll_data$price=as.numeric(gsub("\\$","",mll_data$price))
Ahora, centremos la atencion en Barcelona y analicemos la ubicación y los precios
Definiremos los datos que queremos representar en el mapa: La densidad de las propiedades en Airbnb
map_data = mll_data[(mll_data$neighbourhood_cleansed == "el Raval"), ]
Obtenemos en mapa de Barcelona
MapStamen = qmap("barcelona", source = "stamen", zoom = 14, color = "bw")
MapGoogle=ggmap(
ggmap = get_map(
"Barcelona",
zoom = 14, scale = "auto",
maptype = "terrain",
source = "google"),
extent = "device",
legend = "topright"
)
Creemos un mapa que muestre las zonas de Barcelona y las intensidades, según el precio
bar_circle_size = 0.010
MapGoogle + geom_point(data = map_data, aes(x = longitude, y = latitude), colour = "red", size = map_data$price*bar_circle_size, alpha = 0.5)

Podemos usar un mapa para mostrar las intensidades, según la ubicación:
MapGoogle +
stat_density2d(aes(x = longitude, y = latitude, fill = ..level.., alpha = ..level..),
size = 1, data = map_data, geom = "polygon")

Podemos extender el análisis para el caso de Barcelona
MapGoogle=ggmap(
ggmap = get_map(
"Barcelona",
zoom = 13, scale = "auto",
maptype = "terrain",
source = "google"),
extent = "device",
legend = "topright"
)
MapGoogle +
stat_density2d(aes(x = longitude, y = latitude, fill = ..level.., alpha = ..level..),
size = 2, data = mll_data, geom = "polygon")

Crearemos un mapa que muestre las zonas de Barcelona y las intensidades, según el precio
mll_circle_size = 0.005
MapGoogle + geom_point(data = mll_data, aes(x = longitude, y = latitude), colour = "red", size = mll_data$price*mll_circle_size, alpha = 0.5)

Descripción de los datos antes de ajustar un modelo de regresión lineal que prediga el precio
Examinamos la variable precios
hist(mll_data$price)

mean(mll_data$price, na.rm = TRUE)
[1] 89.62964
El precio puede diferir dependiendo de la localidad. Podemos verificarlo a continuación
neighbourhood_cleansed price
1 Baró de Viver 22.50000
2 Can Baró 60.15385
3 Canyelles 42.50000
4 Ciutat Meridiana 26.90000
5 Diagonal Mar i el Front Marítim del Poblenou 185.94022
6 el Baix Guinardó 64.35659
7 el Barri Gòtic 77.76323
8 el Besòs i el Maresme 52.68317
9 el Bon Pastor 31.72222
10 el Camp d'en Grassot i Gràcia Nova 79.80480
11 el Camp de l'Arpa del Clot 70.62202
12 el Carmel 94.37778
13 el Clot 49.39873
14 el Coll 45.35897
15 el Congrés i els Indians 43.20339
16 el Fort Pienc 111.09091
17 el Guinardó 74.61111
18 el Parc i la Llacuna del Poblenou 79.45082
19 el Poble Sec 81.41600
20 el Poblenou 86.47990
21 el Putxet i el Farró 92.68500
22 el Raval 65.74434
23 el Turó de la Peira 30.85000
24 Horta 38.63889
25 Hostafrancs 89.92800
26 l'Antiga Esquerra de l'Eixample 109.94456
27 la Barceloneta 65.76154
28 la Bordeta 55.44348
29 la Dreta de l'Eixample 158.32805
30 la Font d'en Fargues 74.33333
31 la Font de la Guatlla 61.61789
32 la Guineueta 64.75000
33 la Marina de Port 83.38462
34 la Marina del Prat Vermell 112.11111
35 la Maternitat i Sant Ramon 50.36301
36 la Nova Esquerra de l'Eixample 78.39239
37 la Prosperitat 35.68750
38 la Sagrada Família 92.59495
39 la Sagrera 43.91304
40 la Salut 63.22549
41 la Teixonera 35.62500
42 la Trinitat Nova 29.50000
43 la Trinitat Vella 40.60000
44 la Vall d'Hebron 38.50000
45 la Verneda i la Pau 40.62903
46 la Vila de Gràcia 85.98637
47 la Vila Olímpica del Poblenou 154.28070
48 les Corts 78.28636
49 les Roquetes 35.36364
50 les Tres Torres 110.14706
51 Montbau 26.18750
52 Navas 38.29487
53 Pedralbes 105.94737
54 Porta 32.20513
55 Provençals del Poblenou 62.53125
56 Sant Andreu 44.47458
57 Sant Antoni 89.98194
58 Sant Genís dels Agudells 60.85714
59 Sant Gervasi - Galvany 102.67732
60 Sant Gervasi - la Bonanova 65.02326
61 Sant Martí de Provençals 42.41667
62 Sant Pere, Santa Caterina i la Ribera 72.20000
63 Sants 71.81073
64 Sants - Badal 61.76623
65 Sarrià 67.60606
66 Torre Baró 39.75000
67 Vallbona 30.00000
68 Vallcarca i els Penitents 66.92381
69 Vallvidrera, el Tibidabo i les Planes 119.60000
70 Verdun 29.24000
71 Vilapicina i la Torre Llobeta 39.50943
O dependiendo del tipo de propiedad
table(mll_data$property_type)
Aparthotel Apartment Bed and breakfast Boat
44 14980 229 48
Boutique hotel Cabin Camper/RV Casa particular (Cuba)
112 1 3 43
Cave Chalet Condominium Dome house
2 2 364 2
Dorm Farm stay Guest suite Guesthouse
2 2 151 31
Hostel Hotel House Loft
137 42 334 421
Other Serviced apartment Tiny house Townhouse
92 663 4 35
Villa
19
Podemos examinar el precio según el tipo de propiedad, examinemos la asociacion entre precios y los evaluaciones de los consumidores
aggregate(price ~ property_type, data=mll_data, FUN=mean)
property_type price
1 Aparthotel 119.84091
2 Apartment 86.17323
3 Bed and breakfast 45.88053
4 Boat 241.23256
5 Boutique hotel 220.41237
6 Cabin 25.00000
7 Camper/RV 116.66667
8 Casa particular (Cuba) 44.80488
9 Cave 45.00000
10 Chalet 125.00000
11 Condominium 90.99725
12 Dome house 94.00000
13 Dorm 65.00000
14 Farm stay 58.50000
15 Guest suite 55.04636
16 Guesthouse 52.48387
17 Hostel 91.00000
18 Hotel 319.68750
19 House 76.31420
20 Loft 84.23278
21 Other 83.90110
22 Serviced apartment 166.95735
23 Tiny house 67.50000
24 Townhouse 116.37143
25 Villa 219.00000
aggregate(price ~ review_scores_value, data=mll_data, FUN=mean)
review_scores_value price
1 2 92.48000
2 3 47.50000
3 4 75.58824
4 5 77.93103
5 6 87.98605
6 7 116.84419
7 8 101.66025
8 9 94.17351
9 10 76.28139
cor(mll_data$price, mll_data$review_scores_value, use="pairwise.complete.obs")
[1] -0.07111042
cor(mll_data$price, mll_data$review_scores_rating, use="pairwise.complete.obs")
[1] -0.03170615
aggregate(price ~ review_scores_rating, data=mll_data, FUN=mean)
review_scores_rating price
1 20 88.02439
2 30 45.00000
3 35 150.00000
4 40 95.31111
5 47 61.00000
6 50 72.53333
7 51 80.00000
8 52 51.00000
9 53 124.33333
10 54 30.00000
11 55 82.25000
12 56 450.00000
13 57 82.00000
14 59 160.00000
15 60 83.53416
16 61 48.00000
17 62 85.00000
18 63 139.55556
19 64 134.77778
20 65 153.00000
21 66 36.00000
22 67 121.28205
23 68 48.44444
24 69 191.54545
25 70 84.24691
26 71 138.50000
27 72 113.09524
28 73 78.46875
29 74 87.27273
30 75 107.74510
31 76 100.53333
32 77 90.07463
33 78 113.40580
34 79 70.39024
35 80 91.53138
36 81 106.60000
37 82 100.55752
38 83 101.78261
39 84 95.33065
40 85 88.26255
41 86 101.76667
42 87 98.19737
43 88 90.32972
44 89 91.86161
45 90 89.11321
46 91 89.45518
47 92 100.10476
48 93 88.30426
49 94 88.55067
50 95 88.53890
51 96 87.82254
52 97 84.78470
53 98 87.02399
54 99 98.80780
55 100 80.26882
Ajustemos un modelo lineal de regresión para predecir el precio
Crearemos la muestra de entrenamiento y de prueba
spl = sample.split(mll_data$price, 0.8)
train = subset(mll_data, spl == TRUE)
test = subset(mll_data, spl == FALSE)
mean(train$price, na.rm = TRUE)
[1] 90.32812
mean(test$price, na.rm = TRUE)
[1] 86.81302
Ajustamos el modelo
model_lm = lm(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude, data=train)
summary(model_lm)
Call:
lm(formula = price ~ bedrooms + review_scores_value + number_of_reviews +
square_feet + longitude + latitude, data = train)
Residuals:
Min 1Q Median 3Q Max
-288.18 -48.05 -17.00 20.63 866.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.480e+04 1.959e+04 -0.755 0.450470
bedrooms 4.521e+01 4.958e+00 9.119 < 2e-16 ***
review_scores_value -9.997e+00 7.487e+00 -1.335 0.182592
number_of_reviews -1.816e-01 5.301e-02 -3.426 0.000677 ***
square_feet 2.748e-02 1.056e-02 2.603 0.009593 **
longitude 6.345e+02 3.534e+02 1.795 0.073365 .
latitude 3.273e+02 4.795e+02 0.683 0.495199
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 110.4 on 386 degrees of freedom
(13733 observations deleted due to missingness)
Multiple R-squared: 0.3094, Adjusted R-squared: 0.2987
F-statistic: 28.82 on 6 and 386 DF, p-value: < 2.2e-16
Calculemos el Error Cuadrado Medio
predict_lm = predict(model_lm, newdata = test)
MSE_lm = mean((predict_lm - test$price)^2, na.rm = TRUE)
MSE_lm
[1] 22315.98
data.frame( R2 = R2(predict_lm, test$price, na.rm = TRUE),
RMSE = RMSE(predict_lm, test$price, na.rm = TRUE),
MAE = MAE(predict_lm, test$price, na.rm = TRUE))
R2 RMSE MAE
1 0.1230678 149.3853 75.977
Examinemos la validación cruzada
Imputaremos los valores NA de las variables numericas. No imputaremos dichos valores en el caso de la longitud y latitud
mice_mod = mice(mll_data[, names(mll_data) %in% c("price", "bedrooms", "review_scores_value", "number_of_reviews", "square_feet" )], method = 'pmm', seed = 500)
iter imp variable
1 1 bedrooms square_feet price review_scores_value
1 2 bedrooms square_feet price review_scores_value
1 3 bedrooms square_feet price review_scores_value
1 4 bedrooms square_feet price review_scores_value
1 5 bedrooms square_feet price review_scores_value
2 1 bedrooms square_feet price review_scores_value
2 2 bedrooms square_feet price review_scores_value
2 3 bedrooms square_feet price review_scores_value
2 4 bedrooms square_feet price review_scores_value
2 5 bedrooms square_feet price review_scores_value
3 1 bedrooms square_feet price review_scores_value
3 2 bedrooms square_feet price review_scores_value
3 3 bedrooms square_feet price review_scores_value
3 4 bedrooms square_feet price review_scores_value
3 5 bedrooms square_feet price review_scores_value
4 1 bedrooms square_feet price review_scores_value
4 2 bedrooms square_feet price review_scores_value
4 3 bedrooms square_feet price review_scores_value
4 4 bedrooms square_feet price review_scores_value
4 5 bedrooms square_feet price review_scores_value
5 1 bedrooms square_feet price review_scores_value
5 2 bedrooms square_feet price review_scores_value
5 3 bedrooms square_feet price review_scores_value
5 4 bedrooms square_feet price review_scores_value
5 5 bedrooms square_feet price review_scores_value
mice_output = complete(mice_mod)
Seleccionaremos los datos para el análisis, considerando la data con los valores imputados y las variables donde no se ha hecho tal imputacion
covar=subset(mll_data, select=c("longitude", "latitude", "cancellation_policy"))
data_reg=cbind(covar, mice_output)
set.seed(1991)
train.control = trainControl(method = "cv", number = 10)
Entrenemos un modelo de regresion lineal
model_vc_ols = train(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude + cancellation_policy, data = data_reg, method = "lm", trControl = train.control)
print(model_vc_ols)
Linear Regression
17763 samples
7 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 15987, 15988, 15986, 15987, 15987, 15987, ...
Resampling results:
RMSE Rsquared MAE
88.5438 0.2499516 50.2345
Tuning parameter 'intercept' was held constant at a value of TRUE
Veamos si modelos alternativos pueden mejorar la prediccion. En primer lugar, probemos con una regresion de arbol
model_cart = rpart(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude + cancellation_policy, data=train, method = "anova", na.action = na.omit)
Calculemos el Error Cuadrado Medio
predict_cart = predict(model_cart, newdata = test)
MSE_cart = mean((predict_cart-test$price)^2, na.rm = TRUE)
MSE_cart
[1] 15765.54
Consideremos la validacion cruzada
model_vc_tree = train(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude + cancellation_policy, data = data_reg, method = "rpart", trControl = train.control)
print(model_vc_tree)
CART
17763 samples
7 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 15986, 15988, 15987, 15987, 15987, 15986, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.01474493 89.46055 0.2339815 49.48721
0.05368781 92.03803 0.1889530 51.03603
0.17219969 97.87486 0.1547440 56.56656
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01474493.
Podemos dibujar el arbol
prp(model_cart)

Ahora, probemos con un Random Forest Model
model_rf = randomForest(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude + cancellation_policy, data=train, importance= TRUE, method = "anova", na.action = na.omit)
model_rf
Call:
randomForest(formula = price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude + cancellation_policy, data = train, importance = TRUE, method = "anova", na.action = na.omit)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 2
Mean of squared residuals: 10987.91
% Var explained: 36.59
predict_rf = predict(model_rf, ntree=4, newdata=test)
MSE_rf = mean((predict_rf-test$price)^2,na.rm = TRUE)
MSE_rf
[1] 17121.35
Grafico importancia de las variables
varImpPlot(model_rf, sort=TRUE, main="Importancia", n.var=5)

Tabla con la importancia de las variables
var.imp=data.frame(importance(model_rf, type=2))
var.imp$Variables=row.names(var.imp)
var.imp[order(var.imp$IncNodePurity, decreasing = TRUE),]
IncNodePurity Variables
bedrooms 1415412.3 bedrooms
number_of_reviews 1047097.7 number_of_reviews
square_feet 1005390.1 square_feet
latitude 920292.6 latitude
longitude 780822.6 longitude
review_scores_value 529607.5 review_scores_value
cancellation_policy 148936.2 cancellation_policy